CV4Edu - Computer Vision for Education

Computer vision (CV) plays a central role in multimodal human-centered AI, yet most models are trained on web-scale benchmarks that poorly reflect real classrooms. Educational data are noisy, private, small-scale, and multimodal (e.g., video, audio, text). Students’ cognitive/behavioral states (e.g., engagement, mind-wandering) and learning processes (e.g., self-regulation, collaboration) can be inferred from subtle multimodal cues (e.g., gaze, pose, facial features). Still, today’s models struggle to generalize to classroom data, limiting reliability in deployed human-centered applications (e.g., assistive technology, collaborative AI). CV4Edu brings together computer vision, natural language processing, human-computer interaction, and educational researchers to chart a community agenda for efficient, privacy-aware multimodal data-driven models that are more reliable in low-resource, real-world classroom settings — potentially launching shared datasets, metrics, and unified practices.

Our goal is to support research that bridges CV, NLP, HCI, cognitive science, and the learning sciences/education communities. We welcome submissions both within and beyond education contexts—such as multimodal modeling, sensing, behavior forecasting, cognitive state inference, robotics, and embodied AI—provided they discuss transferability to classroom settings (e.g., what may break or carry over under noise, occlusions, viewpoints, multi-person dynamics, privacy constraints, limited annotations, distribution shift, hardware variability).

Topics

The workshop topics include (but are not limited to):

Multimodal classroom perception
  • Face, gaze, pose, gesture, posture, affect, and prosody
  • Video, audio, gaze sensors, and wearables (egocentric and exocentric)
  • Multimodal fusion, representation learning, and cross-view / multi-camera setups
Language-centered multimodal learning analytics
  • Linking speech/text to video events, gaze/attention, and instructional context
  • Classroom NLP: ASR robustness, diarization, evaluating and mitigating bias, discourse modeling, dialogue/tutoring interactions, simplification, misconception detection
  • Retrieval-augmented classroom analytics, model adaptation, evaluation for learning-aligned outcomes
Robustness & generalization
  • Domain shift beyond the lab, occlusions, noisy data, and missing modalities
  • Few-/low-shot learning, continual and on-device adaptation
  • Generalization across classroom layouts and populations
Human behavior modeling for learning
  • Engagement, attention, affect, confusion, self-regulation, and metacognition
  • Collaboration, group dynamics, and teacher–student interactions
  • Gaze-informed models, saliency/scanpath prediction, activity recognition
Temporal modeling & intervention
  • Sequential/temporal models of learning processes
  • Behavioral forecasting, early-warning systems, and interventions
  • Real-time inference, feedback, and human-in-the-loop systems
Interpretability, reliability & evaluation
  • Interpretable models, uncertainty estimation, and calibration
  • OOD detection, fairness, and bias analysis
  • Evaluation protocols aligned with learning outcomes
Privacy-aware AI, datasets & deployments
  • Privacy-preserving data collection, anonymization, de-identification, and governance
  • Annotation strategies, construct-aligned labeling, active learning, synthetic data, and dataset curation
  • Classroom-ready systems, scalable multimodal data-collection frameworks, edge/on-device inference, and real-world deployments

We encourage general computer-vision, visually grounded NLP, and human-centered, collaborative AI submissions (e.g., behavioral modeling, pose/activity recognition, gaze estimation, attention modeling, multimodal learning, methods “in the wild”, cognitive state inference and forecasting) that make a clear connection to educational/learning environments (even if primarily in the discussion).

Accepted Papers

Archival Papers
1. Mahsa Ardakani, Arshia Eslami, Ramtin Zand
VLMath: A Multimodal Vision-Language System for Pedagogically Aligned Math Tutoring
2. Ahmed Abdelkawy, Ahmed Elsayed, Asem Ali, Aly Farag, Thomas Tretter, Michael McIntyre
Context Matters: Peer-Aware Student Behavioral Engagement Measurement via VLM Action Parsing and LLM Sequence Classification
3. Ziwei Zhao, Xizi Wang, Yuchen Wang, Feng Cheng, David J. Crandall
Sequence-Based Identification of First-Person Camera Wearers in Third-Person Views
4. Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung
Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning
5. Ethan Seefried, Changsoo Jung, Videep Venkatesha, Trevor Chartier, Caleb Christian, Jack Fitzgerald, Mariah Bradford, Sifatul Anindho, Matthew Sturgeon, Nathaniel Blanchard
CSU101: An Educational Dataset for Introductory Computer Vision
6. Martyna Gruszka, Risa Shinoda, Taiki Miyanishi, Takumi Hirose, Nakamasa Inoue
MES-Bench: A Benchmark for Multimodal Elaborative Simplification and Comprehensibility Evaluation in Language Learning
7. Muhammad Rafsan Kabir, Md Shopon, Marina Gavrilova
ReSoFed: Reliability-Guided Model Souping for Robust Federated Learning in Heterogeneous Classroom Environments
8. Xiao Wang, Lu Dong, Ifeoma Nwogu, Srirangaraj Setlur, Venu Govindaraju
InterventionLens: A Multi-Agent Framework for Detecting ASD Intervention Strategies in Parent-Child Shared Reading
9. Chongyu He, Peter Youngs, Scott Acton
Delta-Gated Incremental Multi-Forward-Pass Modeling for Robust Multimodal Classroom Video Understanding
10. Hanchen David Wang, Yilin Liu, Madison Mason, Surya Rayala, Gautam Biswas, Daniel Levin, Meiyi Ma
AI-Assisted Competency Assessment from Egocentric Video in Simulation-Based Nursing Education
11. Lu Dong, Xiao Wang, Mark Frank, Srirangaraj Setlur, Venu Govindaraju, Ifeoma Nwogu
ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos
12. Sifatul Anindho, Videep Venkatesha, Nathaniel Blanchard
Evaluating Web-trained Facial Expression Recognition in Naturalistic Collaborative Learning
13. Yuji Zhang, Duo Zhou, Bo Chen, Adi Chalasani, Noah Schroeder, H Chad Lane, ChengXiang Zhai
Scaffolding Human Learning by Shaping Visual Environment
14. Ekta Sood, Sebastian Ricke, Trisha Mittal, Sidney K. DMello
From Emotion Recognition to Mind-Wandering Detection: A Comparative Analysis of Video-Based Emotion Foundation Models
15. Ashwin T S, Srigowri Mayasandra Prasanna, Joyce Horn Fonteles, Gautam Biswas
Do Emotion Recognition Models Generalize to Classrooms? Robustness and Fairness Analysis
16. Divya Mereddy, Ashwin T S, Marcos Quinones Grueiro, Gautam Biswas
Diagnosis of Human–Object Interaction Detectors for Real-World Educational Applications
17. Suraj Prasad, Pinak Mahapatra
Speech-Synchronized Whiteboard Generation via VLM-Driven Structured Drawing Representations
Non-Archival Papers
1. Yanzhe Chen, Kevin Qinghong Lin, Mike Zheng Shou
Code2Video: A Code-centric Paradigm for Educational Video Creation
2. Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
Paper2Video: Automatic Video Generation from Scientific Papers
3. Pinak Mahapatra, Suraj Prasad
From Knowing to Drawing: Teaching Vision-Language Models Spatial Programs for Educational Diagrams via Visual RL
4. Twumasi Mensah-Boateng, Anirban Roy, Marta K. Mielicki, Nonye M Alozie, Ramneet Kaur, Jing Yuan
STARVIS: A Multimodal Framework for Standards-Aligned Video Classification in Elementary Science Education
5. Aayam Bansal
Sketch2Feedback: A Grammar-in-the-Loop Framework for Rubric-Aligned Feedback on Student STEM Diagrams
6. Aman Goyal, Kshama Nitin Shah
Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization
7. Aryan Kashyap Naveen, Bhuvanesh Singla, Raajan R Wankhade, Shreesha M, Ramu S, Ram Mohana Reddy Guddeti
AutoOEP: A Multi-Cue Framework for Online Exam Proctoring
8. Yanhang Li, Zhichao Fan, Zexin Zhuang
How Much Do We Lose? Quantifying the Impact of Face De-identification on Classroom Computer Vision Tasks
9. Yanhang Li, Zhichao Fan, Zexin Zhuang
Do We Need Faces? Privacy-Preserving Engagement Detection via Face-Free Features in Classroom Video
10. Siddharth Manne, Shaden Alshammari, Satish Somaraju
Socratic: Pushing the Boundaries of Interactive Visual Educational Content Creation.
11. Videep Venkatesha, Ethan Seefried, Changsoo Jung, Nathaniel Blanchard
Modeling Epistemic Vigilance in Collaborative Problem Solving: A Multimodal Approach Using Social Deduction as a Proxy Testbed
12. Thai Quoc Hoang, Ran Xu
EduLens: A Self-Evolving AI Agent on Smart Glasses for Adaptive Teaching Support via Continuous Classroom Perception

Workshop Schedule - June 4 - Room 113

Opening and Goals
Keynotes 1 and 2
Poster Session @ Hall A
Coffee Break
Poster Session @ Hall A (cont.)
Keynotes 3 and 4
Panel and Community Discussion
Closing and Next Steps

Venue

Denver Convention Center
700 14th Street
Denver CO 80202

The workshop will be held together with CVPR 2026.

List of Sponsors

CV4Edu 2026 is made possible by these organizations.